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Abstract 

This article proposes a Bayesian approach to regression with a scalar response 
against vector and tensor covariates. Tensor covariates are commonly vectorized prior 
to analysis, failing to exploit the structure of the tensor, and resulting in poor esti¬ 
mation and predictive performance. We develop a novel class of multiway shrinkage 
priors for the coefficients in tensor regression models. Properties are described, includ¬ 
ing posterior consistency under mild conditions, and an efficient Markov chain Monte 
Carlo algorithm is developed for posterior computation. Simulation studies illustrate 
substantial gains over vectorizing or using existing tensor regression methods in terms 
of estimation and parameter inference. The approach is further illustrated in a neu¬ 
roimaging application. [] 

Keywords: Dimension reduction; multiway shrinkage prior; magnetic resonance imaging 
(MRI); parafac decomposition; posterior consistency; tensor regression 

1 Introduction 

In many application areas, it is common to collect predictors that are structured as a mul¬ 
tiway array or tensor. For example, the elements of this tensor may correspond to voxels 
in a brain image (Lindquist, 2008 Lazar, 2008 Hinrichs et al., 2009; Ryali et al., 2010). 
Existing approaches for quantifying associations between an outcome and such tensor pre¬ 
dictors mostly fall within two groups. The first approach assesses the association between 
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each voxel and the response independently, providing a p-value ‘map’ (Lazar, 2008). The 
p-values can be adjusted for multiple comparisons to identify ‘significant’ sub-regions of the 
tensor. Although this approach is widely used and appealing in its simplicity, clearly such 
independent screening approaches have key disadvantages relative to methods that take into 
account the joint impact of the overall tensor simultaneously. Unfortunately, the literature 
on simultaneous analysis approaches is sparse. 

One naive approach is to simply vectorize the tensor and then use existing methods for 
high-dimensional regression. Such vectorization fails to preserve spatial structure, making it 
more difficult to learn low-dimensional relationships with the response. Efficient learning is of 
critical importance, as the sample size is typically massively smaller than the total number of 
voxels. Alternative approaches within the regression framework include functional regression 
and two stage approaches. The former views the tensor as a discretization of a continuous 
functional predictor. Most of the literature on functional predictors focuses on ID functions; 


Reiss and Ogden (2010) consider the 2D case, but substantial challenges arise in extensions 


to 3D due to dimensionality and collinearity among voxels. Two stage approaches first 
conduct a dimension reduction step, commonly using PCA, and then fit a model using lower 


dimensional predictors (Caffo et ah, 2010). A clear disadvantage of such approaches is that 
the main principal components driving variability in the random tensor may have relatively 
limited impact on the response variable. Potentially, supervised PCA could be used, but it 
is not clear how to implement such an approach in 3D or higher dimensions. 


Zhou et al. (2013) propose extending generalized linear regression to include a tensor 


structured parameter corresponding to the measured tensor predictor. To circumvent diffi¬ 
culties with extensions to higher order tensor predictors, they impose additional structure 
on the tensor parameter, supposing it decomposes as a rank-/? parafac sum (see Section 
2.1). This massively reduces the effective number of parameters to be estimated. They 


develop a penalized likelihood approach where adaptive lasso penalties may be imposed on 
individual margins of the parafac decomposition, focusing on good point estimation for the 
tensor parameter. However, their method relies heavily on cross-validated methods for se¬ 
lecting tuning parameters, with choices for these parameters being sensitive to the tensor 
dimension, the signal-to-noise ratio (degree of sparsity) and the parafac rank. 

Of practical interest is a “self calibrating” procedure which adapts the complexity of the 
model to the data. We propose a principled method to effectively shrink unimportant voxel 
coefficients to zero while maintaining accuracy in estimating important voxel coefficients. 
Our framework gives rise to the task of model-based rank selection, with carefully constructed 
shrinkage priors that naturally induce sparsity within and across ranks for optimal region 
selection. In addition, the need for valid measures of uncertainty on parameter (predictive) 
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estimates is crucial, especially in settings with low or moderate sample sizes, which naturally 
motivates our Bayesian approach. Our approach differs from image reconstruction literature 


as we do not model the distribution of the tensor X (Qiu 2007). It also differs significantly 


from Bayesian tensor modeling literature in which the response is an array/tensor (Dunson 


and Xing, 

2009 

Bhattacharya and Dunson, 

2012) 


2 Tensor regression 

2.1 Basic notation 

Let f3 l = (j3n, • • •, Pip J / and f3 2 = (/? 2 i, • • •, P 2 P2 )' be vectors of length pi and p 2 , respectively. 
The vector outer product (3 1 o (3 2 is a pi x p 2 matrix with (i,j)-th entry Pufaj- A D -way 
outer product between vectors /3- = (Pji, ■ ■ ■ ,Pj Pj ), 1 < j < D, is a p\ x • • • x po multi¬ 
dimensional array denoted B = (3 1 o f3 2 o ■ ■ ■ o f3 D with entries (B) iu __ niD = Y]? =1 f3ji r Define 
a vec(-B) operator as stacking elements of this D -way tensor into a column vector of length 
nf = , Pj■ From the definition of outer products, it is easy to see that vec(/3 1 o/3 2 o • • • o (3 D ) = 
(3d /3i- A D -way tensor B 6 1 has a Tucker decomposition if it can be 

expressed as 


Ri R2 Rd 

B = E E E v,.o o ■ ■ ■ o (i) 

ri = l r2=l r D =l 

. ) . , D D 

where f3j 3 is a dimensional vector, 1 < j < D, and A = (K 1 ,...,r D ) r i,].. ’r D r Li i s referred to 

(V) 

as the core tensor. If one considers {/3 ■ 3 ,1 < rj < Rj, 1 <3 < D} as “factor loadings” and 
Ki,...,r D 1° be the corresponding coefficients, then the Tucker decomposition may be thought 
of as a multiway analogue to factor modeling. 

A rank-/? parafac decomposition emerges as a special case of Tucker decomposition ([Tj) 
when R\ = R 2 — ■ ■ ■ — Rd = R and A n ,...,r D = I( r 1 = r 2 = ■ ■ ■ = r D ). In particular, 
-B g assumes a rank-/? parafac decomposition if 

B = 0...0/3W (2) 

r=l 

where ' is a p 3 dimensional column vector as before, for 1 < j < D and 1 < r < R. These 
vectors are often referred to as ‘margins.’ The parafac decomposition is more widely used 
due to its relative simplicity. 


3 













2.2 Model framework 

Let y £ y denotes a response variable, with z E X <zW and X £ scalar and tensor 

predictors, respectively. We assume response y follows an exponential family distribution 

f(y\0,r) = exp ( ^ 6 + c (y, T )^j (3) 

with natural parameter 0, dispersion r > 0 and known functions a(r), 6(0) and c(y, r). Usual 
GLMs focus on vector predictors z and let g(E(y\z)) = a + z' 7l for a strictly increasing 
canonical link function g(-) and model parameters a £ 3ft, 7 £ To generalize this 
framework to also include tensor predictor X, we let 


g(E(y\z, X)) = a + z ' 7 + (X, B), (X,B)= vec(X)'vec(B) 


(4) 


where B £ <g) j =1 W j is the tensor parameter corresponding to measured tensor predictor X. 
For concreteness, we focus on linear regression with g the identity link. 

The coefficient tensor B has n^Li Pj elements, necessitating substantial dimensionality 
reduction. A rank-1 parafac decomposition assumes B = {3 1 o ••• o 0 D and vec (B) = 
(3 d < 8 ) • • • 0 /3 1 . This reduces to modeling g(E(y\z, X)) = a + z ' 7 + (3\X0 2 when D — 2, 


corresponding to the bilinear model considered in Hung and Wang (2013). Since only the 


single parameter vector /3- captures signal along the jth dimension, a rank-1 assumption 
severely limits flexibility ruling out interactions among dimensions. 


Following Zhou et al. (2013), we use a more flexible rank-i? parafac decomposition for 

£ , 1 < j < D, and 1 < r < R. 


B = Ylr=i @1 * o ''' o 0 ( 'n introduced in (J2]) with ] 
Expression Q then becomes 


R 


g(K(y\z,X)) = a + z' 7 + (x, 0 • • • 0 0 


>S? 


r =1 


-a + z' 7 + {X)i u ...,i D {B) iu _ 

(ii,— j*d) 


ID 


(5) 


where voxel ( y X) il _ ^ D of the tensor predictor has corresponding parameter 




R D 

(*!>■■•’* D ) eV B = 

r= 1 j =1 


( 6 ) 


The model is therefore nonlinear in the parameters defining B. A hierarchical specification 
is completed by placing appropriate priors over unknown model parameters. Existing priors 
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may be chosen for a and 7 , but specification of the prior over the tensor parameters is 
nontrivial; see Sections and [3] and 

Under the assumed rank- 1 ? decomposition for B, model (J5]) requires estimating p + 
R 1 Pj as opposed to p + nf = i Pj parameters for the unstructured vectorized (saturated) 
model. One wonders whether such dramatic dimension reduction retains sufficient flexibility. 
In particular, we are interested in identifying geometric sub-regions of the tensor across 
which the coefficients are not close to zero, with the remaining elements being very close 
to zero. We would also like to accurately estimate the coefficients in these sub-regions. We 
have observed good performance in addressing these goals in extensive simulation studies 
summarized Section [ 6 j consistent with our theoretical analyses in Section |4j 

2.3 Model identifiability 

From model (|5]) it is clear that only voxel-level coefficients are identified and not the in¬ 
dividual tensor margins defining their product-sum given in ([ 6 ]). In the tensor setting, 
identifiability restrictions are understood in light of the following indeterminacies: 

1. Scale indeterminacy, for each r = 1 ,...,??, define A r = (Ai r ,..., Xor) such that 
njLi ^jr — 1- Then replacing (3^ by Aleaves the tensor parameter B unaltered. 

2 . Permutation indeterminacy: Yl^=i°f=i(3f^ = Ylf=i°f=i for any permutation 
P(-) of {1,2,..., R}. In particular, this implies that oU = 1 /3^ are not identihable for 
r — 1 ,..., R. 

3. Orthogonal transformation indeterminacy (D — 2 only): for any orthonormal matrix 
O, one has (/3^0) o (/3^0) = (3^ <g) (3^. 

For D > 2, imposing the following (D — 1 )R constraints ensures identifiability of the margin 
parameters comprising the rank-R parafac decomposition: 

3y = 1, 1 < j < D, 1 < r < R, and > ■■■> (7) 

For our proposed Bayesian method, we seek accurate estimation and inferences on B along 
with state-of-the-art predictive performance. Neither of these goals rely on identifiability of 
the tensor margins, [3^ ^, and hence we avoid identifiability restrictions on these parameters. 
The lack of restrictions simplifies the design of efficient computational algorithms. 


3.2 
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3 Multiway shrinkage priors 

3.1 Vector shrinkage priors 


There has been recent interest in high-dimensional regression with vector predictors, choos¬ 
ing priors which shrink small coefficients towards zero while minimizing shrinkage of large 
coefficients. Many of these priors can be expressed as a global-local (GL) scale mixtures 


(Poison and Scott, 2012) with 


6 j ~ N(0, V’j-t), ipj ~ g, t ~h, 


( 8 ) 


where (9 1 ,... ,6 p ) is a coefficient vector, r is a global scale and Tpj is a local-scale. When 
g is a mixture of two components, with one concentrated near zero and the other away 
from zero, a spike and slab prior is obtained. Many other choices of g and h have been 


considered. Although the GL family is widely used and versatile, Bhattacharya et al. (2014) 


note advantages in drawing the local scales jointly. In particular, they propose to let 


9j ~ DE(-|0jt), eftj ~ Dirichlet(a,..., a), r ~ h. 

where DE(-) denote the double-exponential distribution. For small a and large p, the 
Dirichlet(a,..., a) prior has the property of favoring many values close to zero with a few 
much larger values, but with ^ = 1 . 


3.2 Multiway priors 

We propose a new class of multiway shrinkage priors in the generalized linear model setting 
with tensor valued predictors. Assuming tensor parameter B admits a rank-i? parafac 
decomposition, model ([5]) results in voxel-level coefficients that are a nonlinear function of 
the corresponding tensor margin parameters (see ([6])). Moreover, this implies simultaneous 
shrinkage on each of the riy=i Pj voxe l coefficients as imposed by the prior over R Ylf=i Pj 
parameters. This necessitates careful prior specification on the tensor margins {/3^; 1 < j < 
D, 1 < r < R} such that the induced voxel-level prior has adequate tails so as to prevent 
over shrinkage. 

There are a number of desirable characteristics for a multiway prior on the tensor margins 
in the absence of prior information that certain elements of the tensor are more likely to be 
important. In particular, it is important to ensure that 

1. For each r — 1,... ,R, (dj^,,..., f3p\ D ) and (d^,...,d #\ D ) are equal in distribution, 
for any (i u ... , i D ), {h, ..., k D ) G V B x V B and (ii,...,i D ) ^ {h, ■ ■ ■, k D ). 
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2 . Shrinkage towards a low rank decomposition, with the model adapting to the complex¬ 
ity and signal in the data, effectively deleting unnecessary dimensions. 

3. The prior should favor recovery of contiguous geometric subregions of the tensor across 
which the voxel observations are predictive of the response. 

In addition, the proposed multiway shrinkage prior must have a structure that facilitates 
efficient and reliable model fitting. 


3.3 The multiway Dirichlet GDP prior 

There are many ways of specifying priors over tensor margins /3) r to satisfy the listed 
criteria. In this article we propose a particular choice which we deem the multiway Dirichlet 
generalized double Pareto (M-DGDP) prior. The M-DGDP prior induces shrinkage across 
components in an exchangeable way, setting r r = (j) r T as the global scale for component 
r — 1,..., R, with r rv_/ Ga(a T , b T ) and <f> = (0i,..., (j) R ) Dirichlet (op,..., an). In addition, 
define Wj r = diag^^i,..., Wj r>Pj ) for 1 < j < D and 1 < r < R as local (margin, 
component-specific) scale parameters. The hierarchical margin-level prior is given by 


(3 


(r) 


N(0, (4> r r)W jr ), w jrjk ~ Exp(A^ r /2), X jr ~ Ga(a A , b x ). 


(9) 


Additional flexibility in estimating B r = {/3j 1 < j < D} is accommodated by modeling 

heterogeneity within margins via element-specific scaling w jr ,k.. A common rate parameter 
Xj r encourages sharing of information between the margin elements. Collapsing over the 
element-specific scales, P^l\Xj r , (ft r , r ~ DE(Aj r /y / 0 r r), 1 < k < pj. Prior © leads to a 


GDP prior (Armagan et ah, 2013) on the individual margin coefficients. 


3.4 Prior hyper-parameter elicitation 

It is important to assess how the shrinkage prior § on the margins impacts the induced 
prior on the voxel coefficients. Unfortunately, the distribution of the voxel-level coefficients 
@ is not available in closed form. However, the voxel-level variance under the M-DGDP 
prior is given by 


var (B, 




/ |' R D 

= El var yipS-I^A®. 

^ ^ r =1 j =1 


T 
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r (« 0 + d) 
r(a 0 ) b? 


(2C A ) D E„ 


R 


r =1 


where the last step follows from the MGF of a Gamma distributed random variable. The 
following Lemma provides lower and upper bounds on the variance, which are useful in 
hyperparameter elicitation. 


Lemma 3.1 Under M-DGDP shrinkage prior 0 and for D > 1, if a\ = ■■■ = = 

c/R, c G N +; and with constants C\ = b\/ ((ay — l)(dy — 2)) ; a\ >2 , A r = exp ((D 2 3 — 
3D)/2), then the voxel-level variance is bounded below by Raf(2C\/b T ) D and above by 
A T (2C\/b T ) D exp(aqiLD). 


Proof: See Appendix [TJ 

Hyperparameters in the Dirichlet component of the multiway prior (J9]) play a key role 
in controlling dimensionality of the model, with smaller values favoring more component- 
specific scales r r « 0, and hence collapsing on an effectively lower rank factorization. Figure 
[l] plots realizations from the Dirichlet distribution when R = 3 for different concentration 
parameter^] a. As a f 0, points increasingly tend to concentrate around vertices of the S R ~ 2 
probability simplex, R > 1, leading to increasingly sparse realization^} 



(a) ct = (0.2, 0.2,0.2) 


(b) a = (0.3,0.3,0.3) 


(c) a = (0.5,0.5,0.5) 


Figure 1: Visualization of points in the S 2 probability simplex for 500 independent realiza¬ 
tions of X ~ Dirichlet (a). 


We allow a to be unknown by choosing a discrete uniform prior over a grid A, which we 
choose to be 10 values equally spaced between R~ D and i?~ 010 as a default. Armagan et al. 


(2013) study various choices of (a\,( — b\/a\) that lead to desirable shrinkage properties, 
such as Cauchy-like tails for (5^1 while retaining Laplace-like shrinkage near zero. Empirical 


2 For simplicity we have assumed «! = ••• = an = a. _ 

3 This notion of sparsity is made precise in Yang and Dunson (2014). 












results from simulation studies across a variety of settings in Section [6] reveal no strong 
sensitivity to choices for hyper-parameters a\,b\. From Lemma 3.1, setting a\ = 3 and 
b\ = 2 \/a\ avoids overly narrow variance of the induced prior on Table 1 and 

Figures 2-3 illustrate the induced prior on the tensor elements under our default choices. 



R 

5% 

25% 

50% 

75% 

95% 



1 

0.001 

0.011 

0.057 

0.254 

1.729 

D 

= 2 

5 

0.004 

0.040 

0.164 

0.595 

3.332 



10 

0.005 

0.058 

0.237 

0.852 

4.635 



1 

0.000 

0.001 

0.010 

0.072 

0.917 

D 

= 3 

5 

0.000 

0.009 

0.061 

0.341 

3.382 



10 

0.001 

0.017 

0.111 

0.608 

5.996 


Table 1: Percentiles for \ under the M-DGDP prior with default a\ = 3, b\ = 2 ^/a\, 

b T = aR l / D (v = 1) and a = 1 /R. Statistics are displayed as the dimension D of the tensor 
and its parafac rank decomposition R vary. 


R = 1 


R = 5 


R= 10 


D = 2 





D = 3 





Figure 2: Induced voxel level prior distribution for default specification as a function of a \, 
with b\ = 2 y/a\, b T = aR}! D and a = 1 /R. 


4 Posterior consistency for tensor regression 


4.1 Notation and framework 


We establish convergence results for tensor regression model (12) under the following sim¬ 


plifying assumption^] (i) the intercept is omitted by centering the response; (ii) the error 

4 Simplifying assumptions are merely to ease notation and calculations; all results generalize in a straight¬ 
forward manner. 
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Figure 3: Induced voxel level prior distribution for default specification for a A G {2,3,5}, 
b\ G {1, 3}, with b T = aR l ^ D and a — 1/R. Here, R = 10 and D — 2; note that b T = 2 ^/a\ G 
(1.18,1.50) for the range of a\ considered. 


variance is known as <7$ = 1; and (iii) fixed effects are known as 7 = (0,..., 0). We consider 
an asymptotic setting in which the dimensions of the tensor grow with n. This paradigm 
attempts to capture the fact that tensor dimension YljPj,n is typically substantially larger 
than sample size. This creates theoretical challenges, related to (but distinct from) those 
faced in showing posterior consistency for high dimensional regression (Armagan et ah, 2013) 


and multiway contingency tables (Zhou et al. 2014). 


Suppose the data generating model is in the assumed model class ( 12 ), i.e., having true 
tensor parameter B Q n G 


r = 1 5R Pj,n , error variance erg = 1 and 


R 


B° = £ (fig 0-0 (fig, (fig = OS'S,..., fjgj € W*». 

r =1 

In addition, define F ni F° n G as the vectorized parameters: 


F n = vec (/3^, • • 
F° n = vec(0° ( 


. q( r ) ... / 3 d) 

iMl.ru iMd,„ 


■■■ 3 {r) ) 

1 1 PD,n) 


* 0 ( 1 ) 

,n ? 


iq 0(R) 

' ' 1 Pl,n 1 


/aO(l) fl 0(fl)N 
1 PD.ni ' ' ' ’ Pr 


’D,n ) ' 


Dehne a Kulback-Leibler (KL) neighborhood around the true tensor B° n as 

f 1 n 

B n = \B n : - KL (f(yi\ B °), f(Vi\B n )) < e 

l n ti 

Denote KL(f( yi \B° n ), f(y t \B n )) as KL,. Since KL, = \ {(X h B° n ) - (.X. n B n ))\ the KL 
neighborhood of radius e around B° n can be rewritten as B n = {B n : X {{Xi,B° n ) — 
(X u B n )) 2 < e}. Further let 7 i n and n„ denote prior and posterior densities with n obser- 
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vations, respectively, and 


Jgc f {y n \^n)'^n(F' n ) 

_ n _ 

J nVn\B„)*n(F n ) 


with y n = (yi,... ,y n )' and f(y n \B n ) the density of y n under model (12). Posterior consis¬ 
tency is established by showing that 


Lf„ (£>(j) —* 0 under B° n a.s. as n —» oo. 


( 10 ) 


4.2 Main result 


Our main theorem is that (10) holds under a simple sufficient condition on the prior. 


Theorem 4.1 Let C n = rr^ (p 3 > 0), M n = KlEWX i|||. Given Lemma 


2—1 


7.1 


for any 


e > 0, n n (B n : i ELi KL, ; > e) —> 0 a.s. under B [) n , for the prior 7 r n (B n ) that satisfies 


^ n ( n • 11 B r 


Bl 


< 


2 T) 


> exp(— dn), for all large n 


( 11 ) 


for any d > 0 and y < — d. 


Lemma [7 . 1| verifies the existence of exponentially-consistent tests. The proof of the Lemma 
and theorem are provided in the Appendix. The proposed multiway shrinkage prior satisfies 
( |IH ) and hence leads to posterior consistency under the following theorem. 

Theorem 4.2 For fixed constants Hi, H 2 , Mi, pi, and p 2 > 0, the M-DGDP prior ([9]) yields 
posterior consistency under conditions: 

(a) Hin Pl < M n < H 2 n p2 

(b) sup, =lr . , iPjn |/3" 21 < Mi < oo, for all j — 1,..., D] r = l,..., R 

(c) Ef=iPj>log(py„) = o(n). 


Remark Theorem 


4.2 


require that E,'=dTj,™ grows sub-linearly with sample size n. 
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5 Posterior computation and model fitting 

Letting y G 3ft denote a response, and 2 G X G ® predictors, we let 

y\X,'y,B,a^N(z' 7 + (X,B),a 2 ) 

R 

B = Y^ Br, B r = (3$ ® • • • (8) (3 [ ; ] (12) 

r=l 

O' 2 ~ 7T ct , 7 ~ 7T 7 , (3 P) ~ Trp. 

The noise variance is given a conjugate inverse-gamma prior, a 2 ~ IG(u/2, uSq/ 2), with Sq 
chosen so that Pr(cr 2 < 1) = 0.95. This is done assuming the response is centered and scaled, 
which also removes the need for an intercept; by default we set v = 2. Fixed effects are given 
conjugate normal prior 7 ~ N(0,cr 2 S 07 ), with rescaled prior covariance. Finally, voxel data 
for the tensor predictor are standardized to have mean zero and variance 1 , allowing one to 
assume default values for hyper-parameters of the proposed multiway priors. 


5.1 Posterior computation 

The proposed multiway M-DGDP prior @ leads to efficient posterior computation for 
tensor regression model (12). We rely on marginalization and blocking to reduce auto¬ 
correlation for { (p)! r \w jr ; l < j < D,1 < r < R), (<P,r), ( 7 , cr) j, drawing in sequence from 
(i) [a,$,r\B,W]] (ii) [B, W |<f>, r, 7 , a, y]; and (iii) [ 7 ,<r\B,y\. Step (i) is non-trivial and 
we propose an efficient way to sample this block of parameters compositionally. This is 
essential for good mixing under the M-DGDP prior. Step (ii) is sampled using a sequence 
of draws from full conditional distributions using a back-fitting procedure to iterate draws 
from margin-level conditional distributions across the components. 

( 1 ) Sample [a,$,r\B,W] = [ct\B, W][$, r\a,B,W]; 

(a) Sample from the conditional distribution of Dirichlet concentration parameter [a\B, XV] 

via griddy-Gibbs: form a reference set by drawing M samples from [<&,T\a, B, XV] 
for each a G A. Set w 3 j = n(B\a, <3>z, 7 , W) vr(<Fz, T/|a), 1 < l < M |M|, p(a\B, XV) = 
7r ( Qi ) E/^i 41 and Pr (« = a j\~) = P( a j\ B ,XV)/'£ loil - A p(a\B, W )- 

(b) Next, the rank-specific scales are sampled (see Appendix [7]) as [<E>, r\a*, B, XV] = 


[$\B,W][t\$,B,W]i define p 0 = Y^=i Pji an d recall a T = a r — Ra and 
b T = a(i?/u) 1/D (see Section |3.3[ ) , then 

giG(ct — po/2, 26 r , 2C r ), C r = Y^=\ , and set = 


'R 


draw 0 r 

Vv/E£ 7 0; in parallel for 1 < r < R 
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• draw r ~ giG(a r - Rp 0 /2 ,2 b T , 2 A-)> D r = C r /(p r 

(2) Sample |(/3 j'\wj r , A jr ); l<j<-D,l<r< i?} |<f>, r, 7, cr, ?/. For r = 1,..., R and j = 
cycle over [((3f\w jr ,\ jr )\P^],B_ T ,$,T,'y,(T,y], where [3 ( l] = {/3 z (r) , l =/=■ j} 
and B r — B \ B r \ 


(a) draw [w jr , X jr |/3j r) , 0 r , t] = [w jr \ X jr , /3$ r) , 0 r , r] [A ir |/3j r) , </> r , r]: 

• draw X jr ~ Ga(a A + pj,b\ + WPf'Wi/V&rr)-, and 

• draw Wj rj k ~ giG(|, X 2 r , Pjk/( < f ) r T )) independently for 1 < k < Pj 

(b) draw /3$ r) ~ N(/i jr , E ir ): define = E*=w D =i J ( d i = fc ) ®di,..,d D (IL# 0$) > 


= & 


(r) 


,(D 


E ir = (H? t H? 


j-p,)', Vi = Vi - z'il - J2i^r( x *’ B i) for 1 < i < n; then 

9 , iir-l // I \\~ 1 » i T-r(r) ~ / 9 


/<r 2 + w^KM) 


1 L^jr ^jrHj 


v/° 


(3) Sample [7, a\B, y] = [7] a, y][a 2 \y]; define & = yi - (X h B) for 1 < i < n, then 

(a) draw a 2 ~ IG(a CT ,6 CT ), a a = [n + v)/2, b a = (vs% + \\y\\\ - y 1 Z/x 7 )/ 2 

(b) draw 7 ~ N (/x 7 , <j 2 S 7 ), S 7 = (Z T Z + E^ 1 ) -1 , m 7 = S 7 Z T y. 


6 Simulation studies 


To illustrate hnite-sample performance of the proposed multiway priors, we show results 
from a simulation study with various dimensionality (p, R) and define b = max | B° ix io \ as 
the maximum signal size. Throughout, set pj = p, (Tq = 1 and b = 1 for convenience. I11 
addition, we set 7 0 = (0,..., 0) and focus exclusively on inference for tensor parameter B. 
The following simulated setups are considered: 


1. “Generated” tensor: We construct tensor parameters having rank Ro = {3, 5} with p = 
{64,100} and D = 2. 

2. “Ready made” tensor: We use three tensor (2D) images without generating them from a 
parafac decomposition with known rank. 


Five replicated datasets with n = 1000 are generated according to (12) with x 




N(0,1). The tensor parameters considered are shown in Figure |4j where the magnitude of 
the non-zero voxels is b — 1. Examples are chosen to demonstrate recovery of voxel-level 
coefficients across varying degrees of complexity (dimension, parafac rank) and sparsity {% 
of non-zero voxels; see Figure [4]). The performance of our method with M-DGDP prior ([9]) is 


compared with (i) frequentist tensor regression (FTR)(Zhou et ah, 2013); and (ii) Lasso (on 
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the vectorized tensor predictor). Comparisons are based on (a) voxel mean squared estima¬ 
tion error (true non-zero, true zero, and overall); and (b) frequentist coverage (and length) 
of 95% credible intervals. MCMC was run for 1000 iterations, with a 200 iteration burn-in 
and remaining samples thinned by 5. FTR uses R = 5, selecting the tuning parameter over 
a grid of values to minimize RMSE. To choose initial values, a preliminary analysis was run 
with a coarsened 16 x 16 imag^J In all simulation experiments, we observed rapid mixing 
for the proposed MCMC algorithm. 

Voxel-level RMSE reported in Table [2] demonstrates that our method (M-DGDP) consis¬ 
tently out performs FTR. When the tensor parameter has a low-rank parafac decomposition 
(‘R3-ex’ and ‘R5-ex’), M-DGDP and FTR perform best, with M-DGDP having lower RMSE 
on both true zero and non-zero voxels. This validates empirically prior Q along with our 
suggested default hyper-parameter choices (see Section [3]); in particular, M-DGDP adapts 
to varying degrees of sparsity, shrinking many coefficients close to zero while accurately esti¬ 
mating nonzero voxels. FTR’s performance is sensitive to the performance of cross validation 
for parameter tuning, with the CV grid sensitive to tensor dimension (p, D) and rank R. 
In some cases, overall RMSE was lower for R = 3 even though performance in estimating 
non-zero parameters was worse than for other choices. 

Results in Table [3] demonstrate that M-DGDP yields credible intervals with good frequen¬ 
tist coverage across each of the simulated settings, both overall as well as on the true non-zero 
coefficients. Our method is one of the Erst to offer uncertainty quantification for tensor val¬ 
ued predictors, of critical performance in performing inferences on these parameters. Finally, 
Table [4] provides evidence of the robustness of our method to increasing predictor dimension 
using two of the simulated examples. In both cases, RMSE for FTR worsens considerably on 
the true zero coefficients. For the true nonzero voxels, RMSE increases for both methods as 
the margin dimension increases; on a relative % basis, however, FTR worsens considerably 
more, while on an absolute scale, M-DGDP remains the clear winner. 

7 Simulated response with a real 3D brain image 

We analyze data containing 3D MRI images for 550 adolescents, with information such as 
age and sex available. Age and sex are treated as ordinary scalar covariates while 3D MRI 
images act as tensor covariates. Let X denote a 30 x 30 x 30 3D MRI image, Z\ be the age and 
Z 2 be the sex of an individual. The response is simulated using y rs-/ N(Z / 7+(X, J B 0 ),n 2 ), 
where Z denotes (Z 1} Z 2 )', 7 G 7 Z 2 and B 0 6 7^0x30x30. We assume the true B 0 is a rank-2 

5 On several of the simulated examples, this seems to help optimization, both in terms of runtime as well 
as in terms of RMSE at convergence. 
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(a) R3-ex (7.0%) 


(b) R5-ex (11.0%) 


(c) Shapes (6.8%) 



(d) Eagle (7.4%) (e) Palmtree (8.0%) (f) Horse (18.6%) 


Figure 4: Simulated data with 64 x 64 2D tensor images (p = 64, D — 2). Row 1: The 
first two images (from left) have a rank-3 and rank-5 parafac decomposition; the third image 
is “regular”, although does not have a low-rank parafac decomposition. Row 2: All three 
images are irregular, and do not have a low-rank parafac decomposition. Sparsity (% non¬ 
zero voxels) are displayed in sub-captions. 



R3-ex 

R5-ex 

Shapes 

Eagle 

Palmtree 

Horse 


M-DGDP 

0.023o.oo 

0.021o.oo 

0.243o.oi 

O.2260.02 

0.316o.oi 

0.278o.oi 

vox 0 > 0 

FTR 

0.035o.oo 

0.030o.oo 

0.415 o .o3 

O.354 0 .o3 

0.435o.o2 

0.391o.o3 


Lasso 

0.628o.o2 

0.822o.o3 

0.619o.or 

0.665 0 .o3 

0.698o.o3 

0.8880.01 


M-DGDP 

0.011 0 .oo 

0.014 0 .oo 

O.O71 0 .oo 

0.085o.oo 

0.100 Q .oi 

0.137o.oo 

|vox 0 | = 0 

FTR 

0.022o.oo 

0.020o.oo 

0.127 o .o2 

0.163o.o3 

0.159o.oo 

0.215 o .o2 


Lasso 

0.090o.oo 

0.098o.o2 

O.O8I0.01 

0.097 o .oo 

0.094 o .oi 

0.155o.o2 


M-DGDP 

0.013 

0.015 

0.093 

0.102 

0.131 

0.172 

Overall 

FTR 

0.023 

0.021 

0.164 

0.184 

0.196 

0.257 


Lasso 

0.187 

0.288 

0.179 

0.204 

0.217 

0.407 


Table 2: Comparison of voxel estimation as measured by root mean squared error (RMSE) 
for the six 2D tensor images portrayed in Figure |4j Results from FTR (Zhou et al. 2013) 
use R = 5. By default, R = 10 is used in all M-DGDP runs. 


(Zhou et al. 2013 


tensor, B o = b\ o b 2 ° f>3 + o-i ° «2 ° 0-3- Initialization and standardization of predictors follow 
exactly as prescribed in Section [5j 

The following cases are examined by varying a,’ s and b^s: 
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(a) R3-ex 


10 20 30 40 50 60 

(b) R5-ex 


10 20 30 40 50 60 

(c) Shapes 



-o.; 


10 20 30 40 50 60 

(d) Eagle 

Figure 5: Recovered images for 
is used in all M-DGDP runs. 



10 20 30 40 50 60 


10 20 30 40 50 60 


(e) Palmtree (f) Horse 

the 64 x 64 2D tensor images in Figure |4j By default, R = 10 



R3-ex 

R5-ex 

Shapes 

Eagle 

Palmtree 

Horse 

voxo > 0 

coverage 

0.986o .02 

0.946o.o2 

0.747 0 . 0 i 

0.731o.o4 

0.677o.o4 

0.795 0 .o2 

Overall 

coverage 

length 

0.995o.oi 

0.066 0 .oi 

0.970o.oi 

O.O 6 I 0.01 

0.965o.oo 

0.290q.oo 

0.940o.o2 

0.301q.o3 

0.948o.o2 

0.410o.o3 

0.927o.oi 

0.566o.o2 


Table 3: M-DGDP coverage statistics on generated and ready-made 2D tensor images with 
simulated tensor predictor data. 




1 voxel 

R5- 

ex 

Shapes 



64 

100 

64 

100 


coverage 

> 0 

0.946 0 .o2 

0.991o.oi 

0.747 0 .oi 

0.590o.o6 

M-DGDP 

length 

> 0 

0.061 0 .oi 

0.069 0 .oi 

0.290o.oo 

0-247 0 .oi 

rmse 

> 0 

0.02lo. 00 

0.032(3.oi 

0-243o.oi 

0.320o.o3 


rmse 

= 0 

0.014 0 .oo 

0.014 0 .oo 

0.071o.oo 

0.063o.oo 

FTR 

rmse 

> 0 

0.030o.oo 

0.369 0 .o6 

0.415 0 .o3 

0.586 0 .i4 

rmse 

= 0 

U.020o.oo 

U.lllo.02 

U.127 0.02 

U.l3ho.o2 


Table 4: Sensitivity analysis of voxel estimation error (RMSE) as the tensor dimension 
increases; here pj — p e {64,100} for the 2D tensor images ‘R5-ex’ and ‘Shapes’. 
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Case 1: b 3 — b 2 = (0,..., 0, sin((l : 15) * 7 r/ 4 )), b 3 = (sin((l : 10) * 7 r/ 4 ), 0,..., 0), 
a\ = (0,..., 0, sin((l : 10) * 7 r/ 4 )), a 2 = (0,..., 0, cos((l : 15) * 7 r/ 4 )), 
a 3 = (sin((l : 15) * 7 r/ 4 ), 0,..., 0). 

Case 2: = b 2 = (0,..., 0, sin((l : 15) * 7 r/ 6 )), b 3 = (sin((l : 20) * 7 r/ 6 ), 0,..., 0), 

cii = (0,..., 0, sin((l : 15) * 7 r/ 4 )), a 2 = (0,..., 0, cos((l : 10) * 7 r/ 6 )), 
a 3 = (sin((l : 15) * 7 r/ 6 ), 0,..., 0). 

Case 3: b 3 — b 2 = (0,..., 0 ,sin((l : 20) * 7 t/6 )), b 3 = (sin((l : 20) * 7 r/ 6 ), 0,..., 0), 
ai = (0,..., 0, sin((l : 10) * 7 r/ 4 )), a 2 = (0,..., 0, cos((l : 20) * 7 r/ 4 )), 
a 3 = (sin((l : 20) * 7 r/ 6 ), 0,..., 0). 

These cases correspond to sparse B 0 , with 12%, 18% and 30% nonzero elements, respec¬ 
tively. We implement M-DGDP, FTR with R = 5 fixed, and Lasso with tensor vectorized. 
Both FTR and Zhou et al. (2013) include an L\ penalty (results are shown for the best 
choice of penalty), which can over-shrink voxel coefficients significantly different from zero. 
M-DGDP instead includes heavy tail to prevent such over-shrinkage. This is evident from 
the better performance of M-DGDP prior in terms of estimating nonzero coefficients, see 
Table 5. Table [5] summarizes RMSEs for the estimated tensor coefficients for each of the 
competitors. In each of the above simulations, trace plots for several model parameters 
were monitored and found to mix well using the proposed MCMC algorithm in Section [5] 
Overall, M-DGDP prior performs 10 — 15% better for cases considered in this section. In less 



Case 1 

Case 2 

Case 3 


M-DGDP 

0.39 

0.30 

0.34 

| vox 0 1 > 0 

FTR 

0.46 

0.41 

0.43 


Lasso 

0.46 

0.42 

0.44 


M-DGDP 

0.04 

0.14 

0.10 

I vox 0 1 = 0 

FTR 

0.00 

0.00 

0.00 


Lasso 

0.01 

0.03 

0.02 


M-DGDP 

0.13 

0.20 

0.17 

Overall 

FTR 

0.15 

0.22 

0.18 


Lasso 

0.15 

0.23 

0.18 


Table 5: Comparison of voxel estimation as measured by root mean squared error (RMSE) 
for the coefficients in case 1,2, 3 corresponding to 3D tensor images. Results from both 
M-DGDP and FTR (Zhou et al. 2013) use R — 5. 


sparse cases, it is also evident that M-DGDP tends to outperform Li-optimized methods by 
a greater margin. Importantly, every parameter in M-DGDP is auto-tuned, while L 3 penalty 
results in vastly different performance with varying choices of the tuning parameter. 

While M-DGDP consistently shows coverage over 95% with reasonably short credible 
intervals (see Table [ 6 j) , Li-optimization based methods generally suffer in this regard. For 
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Case 1 

Case 2 

Case 3 

Coverage 

0.98 

0.96 

0.99 

Length 

0.54 

0.87 

2.16 


Table 6 : Coverage and length for 95% credible intervals for M-DGDP. 



Case 1 

Case 2 

Case 3 

7 i : truth = 

0.5 

M-DGDP 

FTR 

0.57 

0.46 

0.54 

0.85 

0.33 

0.95 

72 : truth = 

2 

M-DGDP 

FTR 

2.00 

1.87 

2.04 

0.22 

1.86 

3.30 


Table 7: Point estimates of the coefficients for age and sex for M-DGDP and (Zhou et al. 


2013) along with the true values. 


completeness, we provide point estimates and credible intervals for coefficients corresponding 
to age and sex in Table [TJ The data analysis reveals superior performance of M-DGDP with 
proper characterization of uncertainties. 


Appendix 

MCMC algorithm 

The following derivations concern the M-DGDP prior (J9]) and the sampling algorithm out¬ 
lined in Section 15 .1 i 


For step (lb) Recall from Section 3.3 that r ~ Ga(a r , b T ) and <f> ~ Dirichlet(oi,..., or) 


and denote p 0 = Pj- Then, 


7r(<f)|.B, a;) oc 7r(<f>) / 7r(i?|u;, 4>, r)7r(r)(ir 


R 


OC 


oc 


n 

r =1 
R 

n 

r =1 




' o 


n [( r ^) 

r= 1 


-Po/2 


exp 


1 

T(f) r 


^2\\Pjr\\ 2 /( 2 ^jr)) ^ 1 exp(— 6 r r)dr 


3 =1 


„ £o I 
'■S*r 2 ^ 


T a T -Rf -1 


R Q 

exp (- -j- - b r (T(p r ) j dr 

r=l ' 


with C r = Y^-j=i Wflj r \\ 2 /{ 2 (jJ jr)■ When a T = Ylf=i a r, this contains the kernel of a gener¬ 
alized inverse Gaussian (gIG) distribution for (r0 r ). Recall: X fx(x) = gi G(p,a,b) oc 
x p ~ l exp (—(ax + b/x)/ 2). Following Lemma 7.9, for independent random variable T r ~ f r 
on (0, oo), the joint density of {c f> r = T r /^2~Tf : r = 1,... ,R} has support on <S R_1 . In 
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particular, 


/ oo 

dt, <j) R =l ~^0r- 

r=l r<i? 

Substituting f r (x ) oc x~ 5r exp(— C r /x) exp(— b T x) in the above expression yields 

/ oo R / C \ 

t r -i JJ(0 r . r )-' 5 ’- exp - b T ((p r T)jdT 


R 


[n 

r =1 


R 


' Sr 


r R ZrSr 1 ! ! exp 


II 

r =1 


Cr 


(0 r r) 


b r ((j) r T ) \dr. 


Matching exponents between this expression and the preceding one implies (1) a T — R(po/2) — 
1 = R — S r — 1, and (2) S r = 1 + p 0 /2 — a r . Then, 


a T — i?(l + Po/2 ) — 'y ^ 8 r — R(1 + po/2) — (R + Rpo/2 — y ^ ot r ) — y ^ ot r 

r r r 

as previously noted. Hence, draws from [<h| a,B,W] are obtained by sampling T r ~ f r — 
giG(ct r — po/2, 2b T , 2C r ) independently for r = 1,..., R, and renormalizing. 


Proof of lemma 13.1 

y2 

Proof Using priors defined in (J9J), one has C\ = Ea(1/A 2 ) = ^ lx _i^ ax _ 2 ) f° r an y °a > 2. In 
addition, the following inequalities are useful to bound the latter quantity: 

• If oi = c/R, c G N+, r(a 0 + -D)/r(«o) = tto(«o + l) ■ • • (ao + D — 1). Using the fact that 

log(a: + 1) < x, x > 0, one has log(o 0 ) H-h log(a 0 + D — 1) < a 0 D - 1 + Y^k=ivD -2 

Then a® < T(o: 0 + D)/T(a 0 ) < H r exp(a 0 il) where A r = exp(—1 + J2k=ivD- 2 ^) = 
exp((D 2 -3D)/2), D >2. 

• Trivially, 11*1112 < 1; in addition, by Holder’s inequality, for any x 6 and 0 < r < p, 
one has ||x|| p > k~(r~p) \ \ x \\ r . In our setting, D >2. Taking r = 1 in the latter yields 

11*1112 > r-( d ~ 1 \ 

Recall a 0 = « r = a±R. This leads to the lower and upper bounds for the prior 

voxel-level variance: 

var(R il ,... jiD ) > {2C x ) D {a l R) D R-^/b R = (2 C x )°a?R/b? 
var (B ilr „ iia ) < A t {2C x ) d exp(aq RD)/b °. 
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Consistency proofs 

The proof of Theorem |4.1| relies in part on the existence of exponentially consistent tests. 

Definition An exponentially consistent sequence of test functions <f> n = I{y n G C n ) for 
testing H 0 : B n = B° n vs. Hi : B n ^ B° n satisfies 


E B o ($n) < ci exp(-6in), sup E B „(1 - <£> n ) < c 2 exp (-b 2 n) 


for some c±, c 2 , b i, b 2 > 0. 


Lemma 7.1 Suppose RJ2f=iPj,n — °{ n )> then there exist an exponentially consistent se¬ 
quence of tests <£> n for testing H 0 : B n = B° n vs. Hi : B n ^ B[[. 


Proof We begin by stating that —2 (l(B^) — l(B n )) 


critical region of the test as C n = \ B 


W under B„. We choose the 

^Ej=i Pj,n 71 


~ l(B n ))| > e/4}. Note that 


E s; (4>„) = P B o ( - -(((BJI) - l(B n )) > e/4 ) = P B o (x 


R T,j=lPj,r 


> ne /4 


< exp 


for large n, 


where the last line follows by simplifying Laurent and Massart (2000) and using P(xl > 
x) < exp(— x/A) if x > 8 p. 

Now we will use the fact that 


-mi) - ub „»= - y (» - <. x h B n » 2 - i y (» - <w-b “» 2 

n n z —' n z ' 

3 = 1 J =1 

1 n 2 n 

--V ((x„ B„ - Bl)f + - - (X„ B„))(X„B° - B n ) 

n ' n z ' 

i =1 i =1 

1 n 2 n 

= - V XL, + - V(</, - <*i, Bn))(X ; , - B n ). 

n z —' n z —' 

2=1 2=1 

Note that, under B n , 

n n 

- - PC B° - B n ) ~ iV(0, - AW,). 
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Thus, 


sup E b „( 1-4„)= sup P B A\-(l(Bl)-l(B n ))\<e/i 

B n eBs. B n eBs, \ n 


= sup P Bn 

B n &B^ 

< sup P Bn 

B n £j3Z._ 


< sup P Bn 

B n ^BS._ 


< sup P Bn 

Bn&BS,, 


n 


-(l(B n )-l(B° n )) 

n 


-(l(B n )-l(B n )) 

n 


< e/4 


e/4 < 


n 


fi ±kl, + M^P 

\ n V n Jn 

\ i=i 


n 


y kl, 


Z— 1 


EIU KU z 


n jn 


e/4 < 
e/4 < 


n 


-(l(B n )-l(B n )) 

n 


Where Z ~ iV(0,1). Let T n = 
sup E Sri (l-$ n ) 


EL , KLj z 
n Jn 


— Ji =i Using this fact we have 


< sup P Bri 

Bn&Bn 


n 


J2 KL i + 


i —1 


z 


n \ n 


-e/4 < 


n 


(l(B n )-l{B n )) 




< 


i= 1 


n 


(l(B n ) — l(B n )) 


+ sup P Bn 

Bn&Bn 


< P B n ( ^ < x 


(l(B n ) — l(B n )) 


+ p B n ( \Z\ > -y/ne 


4 A ^ E , = l P.j r 


„ / 9 ne\ ( 3 ne\ / ne\ / ne\ 

+ P B , (x, > T ) < exp ( - — J + exp (--) < 2 exp (--) , 



+ sup 

SnGB,;; 

1 

W\ 

n 

Z=1 

< 2 exp 

(-*)• 


where the last line requires an application of Laurent and Massart (2000). 

Theorem 14.11 

Proof Under Lemma [7. II one has 

S B . f(Vn\B,)7T n (F n ) f e . 


n „m = 


f{y n \Bl) ‘ 
fjyJBn), 




//(l/ n |B B )7T B (F B ) /|g, n (f n ) 

Note that we have 

P B ° («,. > exp(— bin/2)) < E b o (<3> n ) exp(&!n/2) < c\ exp(— bin/2) 
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Therefore E^=i P b“ ($* > exp(— bin/2)) < oo. Using Borel-Cantelli lemma 
Pb° (^n > exp(— b\n/2)i.o.) = 0. It follows that 

<f> n —* 0 a.s. (13) 


In addition, we have 

E s j((l-4>„)iV) = /(l-4>„) 


f(y n \Bn, 

I" 

n/ 


* n (F n )f(y n \B °) 



(1 - $ n )f(y n \B n )TT n (F n ) 

< Slip E Bji (1 - 4> n ) < c 2 exp(-fe 2 n). 

Using a similar technique as above, P B o ((1 — < L ri )lVexp(nfe 2 /2) > exp(—nfe 2 /4)bo.) = 0 so 


exp(fen)(l — $ n )IV —* 0 a.s.. 


(14) 


By Lemma j7d| and (13)-(14) it is enough to show that M = exp(fen) f ^"j^o \ ^n{F n ) —» cxi 


f(Vn\ B n) 


for some fe < fe = We choose b = b. Consider the set "H n = <{ B n : 4. l 0 g 
for some r) which is chosen later. 

M > exp (fen) [ exp f-nj) {^" ) n n (F n ) 


Ry.ABn) 

f(y n \B°) 


<Vh 


Hr, 


n f(Vn\ B n) 


> exp((6 - rj)n)n n (Ti n ). 


Note that 


liogl UUgd 


n 


-o X> - < X *' S »» 2 + 5 X> - S °» : 


i=l 


i =1 


Let y n = (j/i,y n )', i/n = «*!, B n ),..., (X n , B n )) and H° n = ((X 1 , B° n },..., (X n , B° n )). 
Then 


TTr 


Bn : - [-11 Vn - H° n II 2 + II y n - H n || 2 ] < 2y) 

2||y n - if°|| (||y n - if n || - ||y n - H° n ||) + (||y n - ff n || - ||y n - tf°||) 2 | < 2r?) 


^ TTn I B n. • 
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> TTn(B n : i |2||i/ n - H° n \\\\H° n - H n \\ + \\H n - H ° n || 2 | < 2r?) 

> 7T n (iJ n : 1 ||if° - H„|| < ^r, || V n - H ° n || 2 < ( 2 ) 

\ n 3( n J 

> TT n (Ain (~l .A. 2 n) 

where A„ = - Hj|| < ^}, A 2n = {||®„ - Hj|| 2 < (,?}• 

We will show that P B o(A 2 n ) = 1 for all large n. Assume C, n = n^ l+P3 ^ 2 , p% > 0 so that 
( 2 > 8 n for all large n. Then, 


P B °( A 2n ') = P B° (Xn > Cn) < eXp(~Cn/ 2 )- 


Therefore, using Borcl-Cantelli lemma P B o(A' 2n i- 0 .) = 0. Hence P B o {Ain) — 1 f° r all large 
n. It is enough to bound 7r„(Ai n ). Let M n = 11-^*1 II- Now use the fact that 


i\\H n - Hi II = (»,S„ - B“)) 2 < |B„ - B»|| 2 to conclude 

2 Tj 


\B n -Bl\\ 2 < 


3M n ( n 


c Ain- 


(15) 


By (11) one has 7r n (Ai n ) > 7r n (j|.B n — i?°|| 2 < 3A ^h j > exp(—dn) and hence M > 


exp ((6 — r) — d)n ) —>• oo as n —* oo proving the result. 

Theorem 14.21 
Proof Dehne g : M —> M s.t. 

0(«) = ^ ^ 52 11 itfln ll2 + -- - + K 5Z5Zril I 

j= 1 r=l j=l r=l l^j 


0(r)i 


Let K n > 0 be s.t. g(n n ) = zm'c • Note that by Decarte’s rule of sign, the equation 

2 V 

3M n Cn 


3M„Cn 

<7 (ft) — afr = 0 h as a unique positive root. Further 


— < 1 + max 

txrn i=l 


3E^E? =1 n hot* 

Z=1 


2i)/M„C„ 


(16) 


K n < 1 + max < 


( 

2 T] 



_ _ . IlldiA 

3M n ( n R i=i,...,D 

R 


< 


> 


(17) 
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by Lemma 7.6 


Using Lemma 7.2 it is easy to see that 


£! - /S" ( :’ll2 < «... 3 = 1. •••,£>; r = 1, C j||B„ - B% < . 


(18) 


Using ( 


Note that 


151, (||B„ - B% < A) > ». ({\\fll-0%% < j = r = 1, ...,*}). 


VT r 


({l|/3yn - 112 < Kn, .7 = 1, D\ r = 1, ..., i?} \{w jr ,i}^i, \ {^rj^Ll 1 , t) 


> 


' D R 

nn 

.7=1 r=l 


7T „ 


(r) a 0 (r) 


n l I \h'j,n 


P^nlU < A^r}r=l^) 


Therefore, it is enough to bound 7r n (||/3^ — \ < n n , j = 1 = 1 For 

J = 1,..., D, r = 1 


n n(\\P^l - P% ] \\ < ^n\{w jr ,l}^tlAjr, T ) 


\R-1 


> 


Pj,n 


rj,n / \ 

II " n I - ^=1 

\& + K n/Pj,r- 


hi Vv 7 ^ n lTVJj r j(j) 7 


exp 


r 


Hljr,Z0rl~ 


where the last step follows from the fact that J e x2 ^ 2 dx > e ( a2 + b2 )/ 2 (7) _ a y Thus, 


*n{\\P^n - pfiw < Knl^jr, {<t>r}?=l , t) 


R -1 


= E 


KniWP^l ~ P^ II < ^n\{w jr ,lYl=P\r: {Pr}r=i t) 

l/ffl 2 + «n/P; 


> 


> 


2 AC. 


Pj’ n Pj,n 

" 1 1U 


y/2p j>n n<j> r T 

/ 2n n \ 2 jr . y-r 

\ v 2 v /2p i , n 7T0 r r y f = j Jw :h ,t \y/Wj^l 


l =1 ( V w jr,l 

Pj,n Pj : n 


exp 

1 


3,n 


HI jr,l^r'P 

?°( r ) 12 I „-2 


exp 


l^.njP + ^nM.n 


HI } jr.lPr^~ 


dw 


jr,l 


(19) 


Use the change of variable —— = Zj r j and the normalizing constant from the inverse Gaussian 

jr,l 
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density to deduce 



1 ( l/ffil 2 + K 2 Jp j ,n A 2 jr W jr ,i 

ex P [--- -j— -^- ] aWjrj 


-Jw jr^l y Wj r i(p r T 


(\& 2 + K n/Pj,n) 


z j r ,i I \ / z 


: exp 


d. 



'jr,l 


exp 


r T 


Zjr,l 2 - I az i r l 


dZn r 


jr,l 


-A 



2 (\& + K n/Pj,n) ^ 


jr 


\f&T 


\ 


(19) can be written as 

^n(||M"n - P^nW < K n\X jr , {</V&\ t ) 


> 



\ 


Therefore, 

*n(\\P£l ~ P^nW ^ ^|{<Mf=lV) 


> 


2 


Pj,n 7 Cl \ r 

b x 


^ 2 \JPj : n*PrP~ 

2/^tt, 

^ ZV^frT 

2 K n 


r(a A , 


\ Pj.roS^A,?" 1 

A,y ’ exp 


»v J A, 


]r 


f 

\ 



Pj,™ I. a A,r- 
b X,r 


r (Pj,n “1“ ^A,r) 


r(a v ) 

Efir V 2 0°«,* i 2 +^m>) , , 

P.7,n - l - ^'A,r 


1 &A ’ r 



Pj,1 


r(pyn “I" ®A,r) 


2 h,ry/Pj,n<t> r T J r(a v ) 


ETii 

^A,r v0r^" 


+ 1 




+ &A,r 
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The final expression as in the above yields 


KniWPj^ - P^n 11 < Kn,j = 1 ,-,D,r = 1, R\{<f> r }"=/, t) 


ft-1 


> E < 


D R 

nn 

J=1 r=l 


2kx 


Pj,n 


A 


Pj,n+^A,r — 1 


r (]?j,n H - ^A,?" 


2 b\,ry/Pj,n(firT I ^(cL X ,r) ^ 


ESi" \/ 2 


j,n,l 


\ 2 +^l/pj. 




+ 1 


Pj,n+0, A, 5 


We will now use the fact that for 4> r < 1, 


> 


Efi:!* V 2 (l^iS | 2 +«n/Pi,n) 


"1 Pj,n+fflA,r — T 


b\,ry/ 


+ 1 




b\ : rV<t>rT 


Pj 


,n+«A,r Jre [°’ 1 h 


This inequality is critical to provide a lower bound on 7r n (| |/3^ —| < K n ,j = 1,..., D, r = 
1,f?) as following 


7r n(||/3^-^S ; || < Kn,j = 1 ,-,D,r = 


,0(r) 


> 


AjT (f?a) 




D R r 

nn 




nil 
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’<i>eS R 


sr^D Pj,n 
-1 TT-R ll-i mt ~T~ 


nti 




D R 

nn 

j =i '• i 


Pj,n 


b (pj.n T ®A,r 
r(a A ,r) 


r 




Ef=r \Z 2 (ii r ,ii 2 + K n/ft>) 


"I Pi,n+QA,r 


6a,i 


+ 1 


2 exp(-A 2 r) 


d4> (It 


X^rjRa) T-r T-r 



Pj ’ n Tfa, n + a x , r ) 
r (a x ,r) 


D R 


nn 


i 


Ei=i 



+ 1 


Pj,n+ a A,r 


T 


^l+E r =l a X,r~ 2 — 1 


T— 0 


exp(—rA 2 )rfr 



a + a A,r 2 ^ 

r 


dcp 


A 2 1 T(-Ra) |-r |-r 

nAO^llll 


K n 


\JPj,;nb\.r 


Pj,n 


F( Pj,n “1“ Q'\,r) 

r (a A ,r) 
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exp(-A 2 ) nti [r(a + avf yj 

(Ai + OA,ry) r(i2a + ^ ^ r _, CtA,r) 


Dellote C « = r(j‘,[r"a« (A.+g'tU) ) ' Then the above ex P ression g 1 ™ 15 

2r] 


us 




D R r 1 

< - l0g(C 6 ) + ^2^Pj,n - log(Kn) + ^ l °S('Pj.n) + log(&A,r) + log(r(a A ,r) 

0 — 1 o.— 1 L 


j =1 r i 


R D DR 

EE log(r(p nJ + a A ,r)) + X] 5Z(Pi,n + a A,r) log 

r=l j=l j=l r=l 


Y?& M\& + <bN 


+ 1 


°\,r 


( 20 ) 


Using (16) and assumption (b), it is easy to see that A- < G^n P2+Pi ^ IlyLi Pj,n for a constant 
G 5 > 0 for all large n. Therefore, Y^=\ E?=i Pj,n log + § log(p i>n ) + log(6 A ,r) + log(r(a A , r ) 
o(?r). Also, )T)f=l Ef=l log(r (Pj,n + aA,r))] < Ef=l(i°i,n + <**,»•) log(Pj> + <**,»■) = °( n )> b y aS ’ 


sumption (c). Finally, YJLi Y^=\iPj,n + a \r) log 


yR 


ETir \/2 (i^Si 2 +<M' 


+1 


= o(n), by 


assumptions (b) and (c). Thus, — log ^ ir n (B n : || B n — B ° n || 2 < 3A j ; ^ < dn f° r all d > 0, 

for all large n. This proves the result. 
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Supplemental Materials 

This supplement contains additional Lemmas relevant to the article, some of which are well 
known and presented without proof. 

Lemma 7.2 Suppose T — Ti o • • • o T D and F — Fi o • • • o F D are two rank one tensors of 
same dimension. Then 

D—l 

T — F = (Ti — Fi) o • • • o (To — Fo) + 7 i ° ° 7 o, 

1=1 XiUX 2 =l:Z?,|Xi |=^,|X 2 |=Z?—Z 

where 7 ,- = F 3 if j G X 2 ; = Tj - Fj if j G 1 1 . 

Proof We will show it by induction. If D — 2 then, 

T — F = T\ o T 2 — F\ o F 2 = (Ti — Xi) o T 2 + o T 2 — Px o X 2 

= (Ti — Fi) o (T 2 — F 2 ) + (Ti — Fi) o F 2 + Xi o (T 2 — F 2 ). 

Assume the result to hold for D — 1. For D, 

T\ o ■ ■ ■ o To — F\ o ■ ■ ■ o Fo 

= (Ti - Fi) o T-2 o ■ ■ ■ O To + Fi o [T-2 o ■ ■ ■ O To - F-2 o ■ ■ ■ O F D \ 

= (T\ — Fi) o [(T 2 — F 2 ) O • • • o (To — Fo) + F 2 o • • • o Fp+ 

D—2 

Y Y 72°---°7d] + 

1=1 XiUX 2 ,|Xi |=Z,|X 2 |=X>—1—z 

D—2 

F l o [(T 2 - F 2 ) o ■ ■ ■ o (To - Fo) + ^2 Y 7 2 o ''' o 7 d\ 

1=1 XiUX 2 ,|Xi|=Z,|X 2 |=X>—1—z 
D—l 

= (Fi —Fi) o • • • o (To- Fo) + Y 7 x 0 ••• 07 D \. 

1=1 XiUX 2 ,|Xi|=Z,|X 2 |=D-Z 

Hence proved. 

Lemma 7.3 Suppose 6 ~ N(0, S) udf/z £ p.d. and #o G JtF Xef ||0o||zz = 0 / o S^ 1 0 O - X/zen 
/or any t > 0 

e x P(~ ^°f H )P(m < t/ 2) < P(||0 - 6> 0 || 2 < t) < exp(- ^^ g )P(||fl || 2 < t). 


29 



Proof This is a general version of Anderson’s lemma. For more references see Van der Vaart 
& Van Zanten (2008). 


Lemma 7.4 If f \,..., fd are convex functions such that > 0 and f[ < 0 for all i = 
1 ,,d, then YYl=\ fi convex. 


proof of lemma 7.f 


Proof First we prove the result for d = 2. Note that (fif 2 )'' = f” f 2 + f 2 fi + 2f[f 2 > 0. So 
the result holds for d = 2. Also / 1/2 > 0 and (fif 2 )' = f[f 2 + f 2 fi < 0. 

Assume the result to hold for d — 1, i.e. nf=i fi is convex and (nf=i fi)' < 0- Then 



Lemma 7.5 Let gi(x) = (1 + x) v/2 and g 2 {x) = c x , c 2 , c 3 > 0. Then g 1 (x) < 1 + l v/2 

and g' 2 (x) > 0 for all x > 0 and v > 2. 


proof of lemma fL5 


Proof Let h\{x) = (1 + x) v ^ — (1 + xu/2), then h[(x) = (1 + xY^ 2 ~ l — v/2 > 0 for 
all x > 0, v > 2. Further using the fact that /?i(0) = 0, we conclude h\(x) > 0 for all 
x > 0, v > 2. This implies gi(x) < 1+ • The proof of g 2 > 0 for all x > 0 is similar and is 
omitted. 


Lemma 7.6 Let x* be a real root of the polynomial P{x) = akX k + ak-\x k 1 + • • • + a\x — a 0 . 
Then I/\x*\ < 1 + max i= i r .^ ^ 


proof of lemma T6 


Proof Consider the polynomial -Pi(C) = ( k 
of variable with ( = we obtain 




By making a change 



akX k + • • • + a\x — a 0 
aox k 


Note that Pi (^) = 0 is solved by x = x*. Therefore, P\ (() = 0 is solved by ( = T_. The 
result follows by using Cauchy bound on the roots of a polynomial. 


Lemma 7.7 Let x = (xi,... ,x p ) ~ F, where F is a multivariate density function. If 
hi ,..., h p > 0 be functions s.t. 9 lo ^A‘ r T) > ifo en \Yj=i hj( x j) a convex as a multivari¬ 
ate function over x. 
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proof of lemma 7. ? 


Proof Note that 


9 log (hj(xj)) 


h'-(x j) 9 

-rr^ =z? + a 

hj ( Xj ) J 


n a j — 


dx.i 


d 2 log(hj(xj)) 


h j( x j) j d 2 log (hj(xj)) _ h"(xj) _ ( ti 3 {xj) 
h(xj) dx 2 hj(xj) \hj(xj) 


This implies 


> 


0 and Zj = jpppj- Let H(x i,... ,x p ) = n^=i hj( x j)- Then 


X7 2 H(x) 


( h i(xi)Ylj^ihj{xj) h' 1 {x 1 )h' 1 (x 2 )Ujjti,2h j (x j ) 
K(xi )ti 2 {x 2 ) IU,2 h A X j) K{x 2 ) n^2 hj (Xj ) 


h 1 (xi)h p [xp)Y\j^ lp hj{xj) 9 
K(x 2 )h' p (x p ) Uj^ p hj(xj) 


V K(xi)h' p (x p )Y[ m , p h j {x j ) 

( 


— n 0? ^ 


h"{xi) 
hi(xi) 

h' 1 (xi)h 2 (x 2 ) 
hi(xi)h 2 (x 2 ) 


h[(xi)h' 2 (x 2 ) 

hi(xi)h 2 (x 2 ) 

h'zjxi) 

h 2 (x 2 ) 


h 'l( x l) h ' p ( x p) h p( x p) h 2 ( x 2 ) 

\ hi (xi)h p {x p ) h 2 {x 2 )h p {x p ) 


hp{xp) h ' 2 (x 2 ) Ujjt 2 ,p h j( x j) 

h l( x l) h 'p( x p) \ 
hi(x\)h^(xp) 
h ' 2 i x 2)h v {Xp) 
h 2 (x 2 )hp(x p ) 


hp(Xp) 

hp{x p ) J 


— n 0 ? ) 

3 

= n 0? ) 

3 


/ a\ + z\ z\z 2 


ZlZp \ 


Z\Z 2 a 2 + z\ 


Z 2 Zp 


\ ZiZ p Z 2 Z p 

••• a p + z 2 / 



1 z \ N 

1 

diag(ai, ... ,a p ) + 


(zi,... 

Zp) 


\ z p ) 



K( x p)T\.&p h i( x 3) / 


Therefore X7 2 H(x) is a positive de fini te matrix proving the lemma. 


Lemma 7.8 If K v (x) is the modified Bessel function of the second kind with parameter 
z/ G 9ft, then 


2 u ~ 1 T(u) > x v Kjx) > 2 V ~ 1 T(u)e~ x , 


for all x > 0 and v > 0. 

Proof For a detailed proof, see Gaunt (2014). 

Lemma 7.9 Suppose are independent random variables with Tj having density 

fj supported in (0, oo). Let <f>j = 3 T . Then the joint density of ((f)i, ■ ■ ■, <f> m -i) has a 

joint density supported on the simplex S m ~ l and is given by 

/•OO m 

f{<t> 1 , ---I 0 m- 1 ) = / ^ m_1 IJ hi^ > 3 t ) dt i 


where 0 m = 1 - Ya=i 0z- 


31 





























proof of lemma 7 .9 


Proof This result is well known in the theory of normalized random measures (Kruijer et al. 


2010 Zhou and Carin, 2013). 
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